Possible CRON Issue, Routing Dies @ Top Of Each Hour

stephenw10

Check the system log when it happens. I'd expect to see something if the firewall stops passing traffic.

Do you see the outage on multiple connected clients? At the same time?

Steve

House Of Cards

@stephenw10 I'll have to re-enable writing system logs to disk. I had read that log rotation could cause some restarted services, so I turned it off to rule it out, and because it's better for the SSD.

The fact is, I looked though all the logs when I was keeping them, but never saw anything that gave me, of limited knowledge, a clue that something was wrong. For example, there wouldn't even be any entries anywhere during the few minutes it was down, or right before/after. Like it had nothing going on. On the dashboard widget I do notice, if it's open at the time, that my states jump up when it's happening and then drop slowly back down. But if the dashboard isn't open I can't even get to the GUI. It's very unresponsive (GUI) when the system hangs. The dashboard will even time out if I'm already in...

It's as though the whole system hangs, and the states reload. Even though no feed updates or anything are occurring.

And the firewall doesn't stop passing traffic. If I'm streaming something from my media server, for example, playback continues uninterrupted. But if I exit the stream and go back to the server, things won't load or update. So I can't play something new. Like the states aren't there and new connections aren't happening, or happening really slow.

And yes, it happens on everything when it happens, whether it's an internal connection, or something over the WAN...

stephenw10

It could be something else on the network flooding it and then getting blocked. Perhaps a loop that gets created somehow.

If the state spike it's not something blocking traffic. Connections are arriving at the firewall and opening states.

It could be hosts attempting to open connections multiple times when they fail. If those are opening states on WAN it also could be something upstream failing to pass traffic.

House Of Cards

@stephenw10 I may have jumped the gun on states being the culprit.

I've been watching them closely at the top of the hour, and they don't actually jump up there when the system becomes unresponsive. They do jump once in a while when I look at the history, but it's not coinciding with the unresponsiveness. I must have seen it happen as a coincidence, and thought that was happening each time.

I run 2500-4000 states at normal operation. Once in a while, I see that jump to 15000, but I haven't investigated what is causing that. Could be a reload of some IP list or something.

What I did notice is that if I watch the hardware at the top of each hour, the hard drive goes active and thrashes around writing for 1-2 full minutes. It's at that point when everything hangs. Watching the GUI doesn't show anything abnormal as far a CPU usage, or drastic memory changes... But it's reading/writing like crazy when the unresponsiveness happens.

The logs are pointless in the GUI. They don't show a thing going on to tell me what is happening when this occurs. Only "warning" I saw anywhere was that it took 4+ seconds to write the data to disk from vnstatd (Traffic Totals). But it wasn't at the time of the unresponsiveness.

So I'm lost... I'm going to upgrade to 24.03 which just came out today and see if it changes. In the meantime, any idea how to check the disk usage that's happening? Any way to see what is doing the disk thrashing?

Thanks!

stephenw10

Try running: top -HaSP -m io

Like:

last pid: 83078;  load averages:  0.23,  0.27,  0.25                                                                     up 0+07:03:24  23:21:55
290 threads:   5 running, 259 sleeping, 26 waiting
CPU 0:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 1:  0.0% user,  0.0% nice,  6.2% system,  0.0% interrupt, 93.8% idle
CPU 2:  0.8% user,  0.0% nice,  0.4% system,  0.0% interrupt, 98.8% idle
CPU 3:  0.0% user,  0.0% nice,  0.7% system,  0.0% interrupt, 99.3% idle
Mem: 33M Active, 195M Inact, 384M Wired, 7238M Free
ARC: 108M Total, 28M MFU, 76M MRU, 412K Anon, 702K Header, 2812K Other
     81M Compressed, 182M Uncompressed, 2.24:1 Ratio
Swap: 1024M Total, 1024M Free

  PID USERNAME     VCSW  IVCSW   READ  WRITE  FAULT  TOTAL PERCENT COMMAND
   11 root          12     36      0      0      0      0   0.00% [idle{idle: cpu0}]
   11 root          17    139      0      0      0      0   0.00% [idle{idle: cpu3}]
   11 root         156     20      0      0      0      0   0.00% [idle{idle: cpu2}]
   11 root          20     25      0      0      0      0   0.00% [idle{idle: cpu1}]
    0 root           2      0      0      0      0      0   0.00% [kernel{e6000sw0 taskq}]
   12 root         120      0      0      0      0      0   0.00% [intr{swi0: uart uart}]
76247 root         122      0      0      0      0      0   0.00% top -HaSP -m io
    7 root          19      0      0      0      0      0   0.00% [pf purge]
64885 root          20      0      0      0      0      0   0.00% /usr/sbin/powerd -b hadp -a hadp -n hadp
    0 root           4      0      0      0      0      0   0.00% [kernel{if_config_tqg_0}]
32001 root          14      0      0      0      0      0   0.00% /bin/sh /root/7100_fan.sh
    2 root          22      0      0      0      0      0   0.00% [clock{clock (0)}]

House Of Cards

@stephenw10 Does this need to be done using SSH or something? I don't have that set up, and if I put this into the command line tool of the GUI, I only get this...

last pid: 73517;  load averages:  0.77,  0.87,  1.05  up 0+13:51:11    12:33:22
324 threads:   4 running, 303 sleeping, 17 waiting
CPU 0: 10.9% user,  0.8% nice,  3.2% system,  0.2% interrupt, 84.8% idle
CPU 1:  9.2% user,  0.8% nice,  1.9% system,  0.1% interrupt, 88.0% idle
Mem: 646M Active, 2845M Inact, 1498M Laundry, 1407M Wired, 434M Buf, 1716M Free
Swap: 3881M Total, 1805M Used, 2077M Free, 46% Inuse

There isn't anything else...

stephenw10

Yes, that's interactive, it can't be run from the gui command prompt. The Diag > System Activity page shows the top output but in CPU mode not IO.

House Of Cards

@stephenw10 I may have found the culprit...

I shut down NTOPNG and the last time I noticed to check at the top of the hour, I didn't see any sluggishness. I'll watch with it disabled and see if that was the issue.

Is NTOPNG known to do things like this, or is this a bug? I can't be certain, but I started noticing these periods of things not loading around the same timeframe I did the switch from CE to Plus. This was never an issue for years like this... And I'm not sure if this would be an issue with NTOPNG or PFSENSE. But if I can confirm this was the cause of the loss in connectivity, I'm happy to submit whatever you guys need to determine why this would happen. It could be affecting others.

On the other hand, my system is an older machine. Maybe I'm just asking too much of it? This a home environment, and the information in NTOPNG is fun to look at, but I don't really need it running... My hardware should be more than capable for a modest firewall with limited workload though.

I'll report back if this was the issue...

stephenw10

Hmm, ntopng does generate a lot of logging and can use significant CPU but I've never heard of it stopping routing like that.

Is it a very slow disk?

House Of Cards

@stephenw10 No, not a slow disk... It's a SATA III SSD... It was a budget build, but more than adequate for what I'm doing.

https://www.newegg.com/kingston-a400-120gb/p/N82E16820242399?Item=N82E16820242399

This does seem to have stopped the issue, so I'm going to wipe/reinstall NTOPNG, and leave it disabled. I can enable it if I want to do some troubleshooting, but in a home environment, it doesn't justify the wear and tear on the SSD to run all the time.

If I can do anything to help troubleshoot why it would kill the browsing altogether, let me know. I appreciate the insight into figuring this out.