DHCP config is apparently not updated in a safe fashion

tenortim

I ran into a nasty issue today where ntopng filled the drive on my pfsense router (I've posted about that in the ntopng forum).
The nasty part was that this only became apparent when I was deleting a static dhcp definition for a laptop that no longer exists, and the result was that it destroyed most of my dhcp configuration (the first 18 entries were all that were left).
The only safe way to perform an update is to write a new file (flush and verify it was fully written) and then atomically move into place. It seems that is not happening for the dhcp config update.
I have backups, but it would be really great if the update code were more careful/paranoid.
Should I open an issue?

Another thought. Given that the consumption of space happened over an extended period, is there any way to get email alerts when utilization goes over a threshold (e.g. 90%)?

Tim

KOM

You probably won't get much action on the dhcp server issue here. Perhaps upstream at FreeBSD?

As for alerts, I run the Zabbix network monitor and use Zabbix agent package on pfSense for system metrics etc.

tenortim

@KOM It's not the upstream server that's at fault because that code doesn't ever change the config, it's the way the UI/php code updates the config. The config itself was trashed because it wasn't updated in a way that's safe if the disk is full. Does that make sense?

KOM

Make sense now. I thought you were complaining about the DHCP service in FreeBSD. Open an issue on Redmine if you like.

tenortim

Thanks @Kom. Will do. I switched over to my old system temporarily, so I can spend a bit more time looking at what files got damaged compared to my backup from last week.

KOM

pfSense supports the autobackup feature which saves the last n copies of your config. You might have been able to rollback to a previous config via Diagnostics - Backup & Restore - Config History.

Gertjan

@tenortim said in DHCP config is apparently not updated in a safe fashion:

and the result was that it destroyed most of my dhcp configuratio

Your disk ran out of space.
This was logged I guess.
pfSEnse, as "any other device with an OS" will go on up untill the bitter end.

In your case, things when wrong when the dhcp.conf file was rewritten.
Next time it could be the pfSense config.xml file - or any other config file based on config.xml (a couple of hundred).

Your dhcp.conf file was probably written correctly - as far as PHP can check - but the underlying OS died when it was closing the file. These actions are being done in parallel - your file system becomes 'dirty' and not-closed files are at risk.

pfSense itself logs to circular log files, because it's known to run on limited RAM/disk machines.
Installing packages that are based on 'tracking' should be logged to dedicated (syslog server / some NAS / where) because if something goes on, and the pfSEnse dies, it takes your log with it - the same log that could explian you post postmortem what actually happened.

See https://www.test-domaine.fr/munin/brit-hotel-fumel.net/index.htmlfor an example I do receive alert mails if some values are going over some predefined limit.

tenortim

@KOM, yes I have backups, and, once I manually deleted the 9.2GB of rrd logs that ntopng had generated going back over a year, restoring the backup is easy.

@Gertjan, no that is not what happened. The OS didn't die. It was just fine. The system was up, just no longer operating correctly. What I think happened (I haven't looked at the code yet) was that the UI tried to overwrite the file and this failed part way with no space leaving a truncated file.

On a POSIX-compliant system (such as FreeBSD), it is entirely possible to do this in a safe way:

Create a new temporary file.
Write contents, checking the error return from write().
Call fsync() on the file and check the error return.
Close the file and check the error return.
If all of the above succeeds, you now have the new config written to stable storage and even if the OS crashes, that data is safe.

Finally, 5) rename the temp file over the config file.
Again, POSIX guarantees that rename() is atomic and that either the original file or the replacement file will exist regardless of whether the system crashes at any point during the rename call.

If we're already doing that, then there's an OS/filesystem bug. If we're not, then we're not updating safely and are susceptible to failure if/when the filesystem fills.

KOM

For what it's worth, I have NEVER seen an operating system gracefully handle a full system disk. Not one. They all hang, barf or choke in one way or another.

tenortim

@KOM said in DHCP config is apparently not updated in a safe fashion:

For what it's worth, I have NEVER seen an operating system gracefully handle a full system disk. Not one. They all hang, barf or choke in one way or another.

I would agree with you there with the fine distinction that the kernel handles it just fine, but generally, userspace doesn't do so well. But hanging would have been benign. Truncating critical files, less so. And I'm painfully aware just how little userspace code is written to be highly resilient/safe in the face of errors (it's tedious and painful to do).

KOM

I use Zabbix to monitor my infrastructure, and pfSense has Zabbix packages. It would notify you if the disk got below 10% free space, for instance. Not exactly an ideal fix for your issue but at least you would know you were getting close to full before a major incident happened.

tenortim

@KOM Thanks. That's a really great suggestion. We use Zabbix to monitor our infrastructure at my day job, and the infrastructure team seem happy with it. Time to roll it out at home!

thestyledare

This post is deleted!