NTP server issue in PFSense 2.7.2 ?

DBMandrake

Hi All,

Since 2.6.0 I've been using the NTP server on PFSense - mainly to provide an NTP server on our CCTV VLAN so that IP cameras have an NTP server available without allowing them any internet access, however the service is also available on a couple of other internal VLAN's.

This worked well for a long time but recently I've noticed it has not been reliable but I could not pin anything specific down.

Unrelated to this I tried to set up NTP service monitoring in LibreNMS to monitor the NTP server in PFSense (using the Nagios check_ntp_time plugin) and found it was unable to query PFSense's NTP server at all, with a strange error:

./check_ntp_time -H 10.0.1.254
NTP CRITICAL: Offset unknown|

After browsing a few discussion threads here I found people reporting similar issues beginning around the release of 2.7.2, but none that seemed a close enough match that I thought it was appropriate to piggyback on the thread.

On a hunch I tried disabling "Kiss of Death" packets in the "Default access restrictions" and presto, it works now, going from 100% failure rate for queries to 100% success:

./check_ntp_time -H 10.0.1.254
NTP OK: Offset 0.0009708702564 secs|offset=0.000971s;60.000000;120.000000;

What I don't understand is why this change is necessary in 2.7.2 though - hence this thread.

I only have a surface level understanding of the NTP protocol but from what I've read the Kiss of Death packet is a way for the server to tell a client to stop querying it as a form of rate limiting ?

Great - but why is it causing all requests to fail even if I only manually trigger a request once in 5 minutes for example ? Is this a bug ?

CCTV cameras typically default to NTP queries every 10 minutes, and my LibreNMS plugin has to query every 5 minutes as part of its function to alert if the service is down, so frequent queries from clients are to be expected in my use case.

I have no desire to rate limit queries to the NTP server - due to firewall rules it is not publicly accessible so only internal LAN clients on a few specific VLAN's can send it queries, and I want all queries answered reliably - so am I OK to just disable Kiss of Death, or is there an underlying problem I should be investigating ?

Anyone else experienced this problem and found this solution / workaround ?

For an NTP server on a firewall like PFSense which is usually going to be serving only LAN clients, should Kiss of Death actually default to off in the configuration ? It's a very different use case than a publicly accessible server which is serving unknown clients across the internet.

NollipfSense

@DBMandrake said in NTP server issue in PFSense 2.7.2 ?:

so am I OK to just disable Kiss of Death,

I would...if just to see whether it resolve your issue.

DBMandrake

@NollipfSense Well, so far it does resolve the issue.

But I'd like to get a better understanding of why changing one of the default settings is seemingly necessary to have it working at all in 2.7.2.

I'm pretty sure this wasn't an issue in older versions of PFSense, and I have certainly never customised this setting before so I'm guessing that something changed either in the default configuration or the underlying NTP daemon in one of the updates between 2.6.0 and 2.72. (I have upgraded through every intermediate version)

Better understanding of the underlying issue is also helpful for anyone else searching for an answer to this problem who comes across this thread to know whether my suggestion is a good idea or just a workaround.

stephenw10

Hmm, nothing has changed with regard to that setting between 2.6 and 2.7.2. You can check the generated file in /var/etc/ntpd.conf.

DBMandrake

@stephenw10 New binary with slightly different behaviour due to the new FreeBSD base version ?

The ntpd.conf doesn't show anything particularly interesting - the kod option is present or absent in the config file depending on the GUI choice as would be expected.

I don't have a 2.6.0 system handy to see what the default setting for kod is when NTP is installed, or what ntpd.conf is generated however - is it possible the default setting for kod has changed at some point ?

stephenw10

It's possible the kod option is interpreted differently.

serbus

@DBMandrake

Hello!

Maybe related to https://github.com/nagios-plugins/nagios-plugins/issues/687

John

DBMandrake

@serbus Hi,

Related, yes - they are seeing the same problem I was.

I don't agree with their conclusion that there is a "bug" in the Nagios plugin though.

The whole purpose of the plugin is to be run once every 5 minutes during a Nagios (or in my case LibreNMS) polling session to confirm whether the NTP server is responsive or not.

It is not a "real" ntp client, it just sends a static ntp client request and checks to see if an answer is received, it does not keep any internal state so would not know how to "back off" in response to a KOD packet.

If I turn KOD back on (which triggers the ntpd server to restart) the very first request from the check_ntp_time plugin is rejected, and all subsequent requests are as well, even if I leave it a long time between requests.

I would expect at least the first query to be allowed if the real cause is rate limiting, but every single request is rejected when KOD is enabled. (What is the actual value of the rate limit ? How many minutes per request per client ? I haven't seen it documented anywhere)

As a workaround one could use "Custom Access Restrictions" to create a custom configuration for the IP address of the LibreNMS/Nagios monitoring server where KOD is disabled to allow it to poll freely.

However in my case I have around 60 IP cameras that are each polling every 10 minutes (changing this is outside my control) and I want the ntp queries to always be answered, so I guess I just disable KOD and be happy that I found a solution, and hope that anyone else having the same issue finds my thread. (which is the real reason I posted about this as I already had a solution before posting)

serbus

@DBMandrake

Hello!

I dont think the problem is that you are calling check_ntp_time too frequently. My understanding is that the code inside the check_ntp_time was making numerous, non-delayed queries to the ntp server, which is what triggered the kod detection.

It looks like a --delay option flag was added to the code that you can use the prevent the KoD.

https://nagios-plugins.org/doc/man/check_ntp_time.html

John

stephenw10

Did disabling KoD also affect the ntp updates at the cameras?

DBMandrake

@serbus If each invocation of check_ntp_time is making numerous queries with no delay (why ??) as you say, it causing a problem would make sense. It seems like pretty dumb default behaviour.

The delay option looks like it could be a workaround, the problem is without trial and error I don't know what timing would and wouldn't be considered acceptable to the KoD algorithm as I haven't seen this documented.

I think in my case I'll just keep KoD disabled as all clients are an internal network.

Edit: just checked and found the version of check_ntp_time on the system doesn't support the delay option. (Ubuntu 22.04.4 LTS, installed via APT)

DBMandrake

@stephenw10

Hi - hard to be sure, as there's not really any diagnostics or logging available on the camera UI's to test NTP, other than just setting the time to be wrong then trying to force an update, which I'm not keen to do manually on over 60 cameras...

The proof of the pudding will be if they start drifting out of time again, but that will take a while to find out.

I think in my specific use case of it only serving specific internal clients that disabling KoD is the best option.