Something taking up all the space on my system

SteveITS

@troutpocket I vaguely remember a similar post but only vaguely. Do you have log compression on? If so try disabling it.

Troutpocket

@steveits

Yes. It's set for bzip2. I'll turn it off for the night and see what's happening in the morning.

Troutpocket

@steveits

No dice. It's 109% full again. What else could it be?

Troutpocket

Here's the fsck during single user:

Forcing filesystem check (5 times)...
** /dev/ufsid/5b0df80d6c33863e
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
uhub0: 8 ports with 8 removable, self powered
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
36366 files, 638029 used, 6213394 free (4610 frags, 776098 blocks, 0.1% fragmentation)

***** FILE SYSTEM IS CLEAN *****
** /dev/ufsid/5b0df80d6c33863e
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
36366 files, 638029 used, 6213394 free (4610 frags, 776098 blocks, 0.1% fragmentation)

***** FILE SYSTEM IS CLEAN *****
** /dev/ufsid/5b0df80d6c33863e
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
36366 files, 638029 used, 6213394 free (4610 frags, 776098 blocks, 0.1% fragmentation)

***** FILE SYSTEM IS CLEAN *****
** /dev/ufsid/5b0df80d6c33863e
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
36366 files, 638029 used, 6213394 free (4610 frags, 776098 blocks, 0.1% fragmentation)

***** FILE SYSTEM IS CLEAN *****
** /dev/ufsid/5b0df80d6c33863e
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
36366 files, 638029 used, 6213394 free (4610 frags, 776098 blocks, 0.1% fragmentation)

***** FILE SYSTEM IS CLEAN *****
/dev/ufsid/5b0df80d6c33863e: FILE SYSTEM CLEAN; SKIPPING CHECKS
/dev/ufsid/5b0df80d6c33863e: clean, 6213394 free (4610 frags, 776098 blocks, 0.1% fragmentation)
Filesystems are clean, continuing...
Mounting filesystems...

stephenw10

And after running that it's back down to the expected usage?

Troutpocket

@stephenw10

When it reaches 100% full I reboot and do the fsck. It comes back with no config so I restore a good config from backup, reboot again, and the system is back to normal, but slowly filling up with invisible stuff.

stephenw10

@troutpocket said in Something taking up all the space on my system:

It comes back with no config

Hmm, well that's odd. There have been bugs in the past where the config file get updated with bad data and grows exponentially. Do you see no config file at all in /conf? Or in /conf/backup?

Troutpocket

@stephenw10

After the reboot, the config.xml file is a fresh 8k file. /conf/backup is full of my backup configs, plus I have one off-line I can use. Everything looks good and healthy otherwise. Restoring the good config brings things back to "normal".

stephenw10

Hmm, maybe check the config file size periodically. Make sure it's not increasing before this happens.

Troutpocket

@stephenw10

I did. It's not changing. There isn't any file or folder I can find that is dramatically increasing in size. Basically, 24GB is steadily growing on the root filesystem in some way not generally visible to regular filesystem tools. I have a good graph from grafana that I'll post later which helps visualize the linear growth.

Troutpocket

@stephenw10

Here's the last 48 hours. It gracefully fills up until about 30% then there's this weird jaggy thing. Maybe syslog is attempting to trim logs?

The graph goes back to zero when it's 100% full probably because it can't send telegraf data to the logger any more. Then I reboot and fsck and we start again. This trend goes back at least month. I don't keep logs like this longer so I'm not sure when it started.

alt text

stephenw10

How are you pulling that data? I assume it lines up with the output from df at that time?

It's not something I've seen locally where there was no obvious process filling the filesystem.

Troutpocket

@stephenw10

Telegraf dumps timeseries data from the pfsense firewall to a separate "logger" system (influxdb). It's not stored locally. We do this on 50+ pfsense firewalls and it's not happening anywhere else. I've been comparing configs across multiple sites and they're all nearly identical. I bang these out a few times each month.

I guess at this point it has become an academic curiosity for me more than anything else. I can fail over to the other half of the HA pair (yay CARP!).

SteveITS

@troutpocket it’s HA? Does it happen on the backup if you make that master?

Troutpocket

@steveits
Nope!

I think I tracked something down... I decided to stop Suricata and see what happens. I watched the graph for a bit and the disk space usage leveled out.

The graph stopped going up when I stopped Suricata. Dropped like off a cliff when I uninstalled it. Remained level after that.

alt text

It's like the suricata logs are growing at a normal rate (and truncating at 5GB as configured), but the filesystem thinks they're still growing.

SteveITS

@troutpocket but if it doesn’t happen on the backup, I’d think it’s therefore not a configuration issue?

Troutpocket

@steveits

I blew away suricata and reinstalled/reconfigured it. I'm watching the system closely and will report back in 24 hrs.

Troutpocket

@troutpocket

So far so good. Looks like something about suricata was causing this weird invisible filesystem creep.

bmeeks

@troutpocket said in Something taking up all the space on my system:

@troutpocket

So far so good. Looks like something about suricata was causing this weird invisible filesystem creep.

Probably a zombie Suricata process. Certain combinations of unusual events can result in more than one Suricata instance running on the same interface. That leads to weird troubles. One of those could easily have been a stuck "open" invisible log file.

In the future, to see if duplicate Suricata processes are running, execute this command from a CLI prompt (directly on the console or via SSH):

ps -ax | grep suricata

You should see only one Suricata process ID (PID) per interface. If you see duplicate entries listed, you will need to kill them. Best to stop all instances using the GUI on the INTERFACES tab in Suricata, then run that CLI command again and manually kill any remaining process IDs. Then return to the INTERFACES tab in the GUI and start Suricata again on the configured interfaces.

Troutpocket

@bmeeks
Even after rebooting the firewall?