Pfsense 2.3/NTP/ESXi - NTP not stable

TugBoat

There appears to be a problem in NTP that is associated with the pfsense 2.3 (and associated minor step releases) when used in conjunction with VMware ESXi and when the VM guest tools are installed. The problem is that the NTP will never stabilise the time, this fault appears even in very simple cases such as one local NTP server being used.

My assumption is that this is actually related to FreeBSD/NTP/vmware not pfsense directly, however, it does make the 2.3 pfsense release less than useful in a virtualised environment. At this stage I have only done lots of testing with pfsense and I have not tried to install FreeBSD itself and confirm the findings in the standard pfsense release. My objective at this stage is to get a stable pfsense instance running, which at this stage means using pfsense 2.2.6

My test is to configure a VM for pfsense and hook its 'WAN' to the local network and then configure the pfsense NTP to point at a local NTP server and wait for an hour and see what happens to the NTP stats on the test system. On a 'working' system the offset and jitter quickly drop to near to zero and the Poll interval increases. On a 'non working' system the offset and jitter increase and the Poll interval stays at 64.

The problem is much more evident if you configure the test system with 3 NTP servers from the pfsense NTP pool.

I have tested 2.3 with both the pfsense openvmtools package, and also the VMware standard tools (requires a manual installation of compat6 from the FreeBSD archives), it makes no difference which tools are in use.

I have also tried the various minor step releases to 2.3 and these do not make any difference to this problem.

My results are:

ESXi 5.5U3/pfsense 2.2.5/open-vm-tools : works OK
ESXi 6.0U2/pfsense 2.2.6/open-vm-tools : works OK
ESXi 6.0U2/pfsense 2.3/no vm tools : works OK
ESXi 6.0U2/pfsense 2.3/pfsense openvm package: does not work
ESXi 6.0U2/pfsense 2.3/vmware guest tools: does not work

Tim

johnpoz

"The problem is much more evident if you configure the test system with 3 NTP servers from the pfsense NTP pool."

I run my pfsense on esxi, and have no problems with time stability..

Waiting an hour not really very long time for ntp to become stable ;) Using pool doesn't bode well for allowing for the poll time to increase to a higher value because you might only talk to the same pool member for a few queries.

pool servers go offline all the time, they change all the time. You do understand resolving ntp.pool.org is a round robin that changes all the time.. And has a very short ttl..

;; QUESTION SECTION:
;0.pool.ntp.org. IN A

;; ANSWER SECTION:
0.pool.ntp.org. 150 IN A 74.123.31.4
0.pool.ntp.org. 150 IN A 173.230.235.13
0.pool.ntp.org. 150 IN A 132.163.4.102
0.pool.ntp.org. 150 IN A 204.2.134.163
;; WHEN: Sat Jun 18 00:06:04 Central Daylight Time 2016

150 second TTL… So in 2.5 minutes those IPs will be different..

;; QUESTION SECTION:
;0.pool.ntp.org. IN A

;; ANSWER SECTION:
0.pool.ntp.org. 141 IN A 64.71.128.26
0.pool.ntp.org. 141 IN A 129.250.35.251
0.pool.ntp.org. 141 IN A 173.255.246.13
0.pool.ntp.org. 141 IN A 72.14.183.239

;; WHEN: Sat Jun 18 00:08:35 Central Daylight Time 2016

How exactly do you hope to get in sync with a ntp server if you keep changing ntp servers.. The poll time does not go up unless it can, if you keep changing ntp servers poll will stat short.

What is your local ntp server syncing time from? What kind of time is it keeping, what is it keeping time with? Why do you think you need the vm tools to sync time? Does it show the time sync'd or not after you hour? Your reach is 377? What are you other values can you post them after you have been running for this hour..


ntpq> ntpq> pe
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*pi3-ntp.local.l .PPS.            1 u  494  512  377    0.329   -0.541   0.045
+esxi.local.lan  142.66.101.13    2 u  374  512  377    0.832    3.143   0.875

I just updated to 2.3.1_5 this morning and pfsense has been up
18 Hours 27 Minutes 13 Seconds

So you see my poll is 512.. How ntp determines when the poll interval goes up is a pretty complicated thing.. Because your poll is still at 64 after 1 hour is not a very good measurement of if ntp is working. So as I said I just rebooted mine this morning.. So just in the 18 hours since its been running you can see how the clock jitter has dropped.. This sure shows me that ntp is doing its job and working on keeping the time pfsense thinks it is as close to possible and in sync with the reference clock.

what is the output of rl?


ntpq> rl
associd=0 status=0615 leap_none, sync_ntp, 1 event, clock_sync,
version="ntpd 4.2.8p8@1.3265-o Fri Jun 10 15:40:53 UTC 2016 (1)",
processor="amd64", system="FreeBSD/10.3-RELEASE-p3", leap=00, stratum=2,
precision=-18, rootdelay=0.421, rootdisp=28.041, refid=192.168.9.32,
reftime=db0f5c2c.027ff9c7  Sat, Jun 18 2016  0:35:40.009,
clock=db0f5f85.fee393ab  Sat, Jun 18 2016  0:49:57.995, peer=56395, tc=9,
mintc=3, offset=-0.549558, frequency=-55.485, sys_jitter=0.043010,
clk_jitter=1.018, clk_wander=0.029

ntpgraphs.jpg_thumb

TugBoat

Hi John,

Thanks for your comments, it is very useful to see your results with NTP as it gives me something to work with. I am starting to wonder whether what I am seeing is somehow related to the specific processor/hardware of one of my ESXi systems.

I will try and address some of your questions, I kept my initial post as short as I could so as to not go into too much detail. I do understand how NTP works, including the rotating pools etc.

I should say that I started looking at this issue because I noticed that my pfsense instance had NTP that didn't look like it had locked after about 20 days. When I investigated other VMs (mainly Linux) within the same ESXi host they all had stable locks with NTP.

Firstly, yes I understand that waiting an hour is not that long. However, when doing testing I was trying to create a relatively systematic test procedure. With my testing reduced to a single locally connected NTP server I remove the various complexities about using the NTP pool servers. The difference I see it testing is stark and clearly visible within 1 hour in this environment.

Secondly, you are correct the vmware tools are not required for NTP to function. What however is interesting in my testing is that without the vmware tools the 2.3 pfsense works "fine". In fact, without, the vmware tools the 2.3 pfsense works pretty much like the the 2.2.5/2.2.6 does with the vmware tools.

What is strange is that the installation of vmware tools into 2.3 appears to cause problems for NTP. What I typically see is shown in the 2 day graph below. Just installing the vmware tools appears to drastically alter the operation of NTP.

What is interesting in the graph below is that in the last few hours it looks like NTP is going to stabilise. This is after weeks of runtime - with the oscillating pattern. It is also after I had posted my initial post. :(

Unfortunately the 2.2.5 RRD graphing can't easily supply the same graph, however, it shows the NTP offset stabilising within a couple of hours.

So I am still mystified by why:

Installing the vmware tools (either the 'official' or 'openvm' ones) appears to interact badly with NTP.
After 20 days or more my system I was initially investigating has suddenly started to work.

Unfortunately, we all have limited time and with things like NTP it is very hard to get to firm grasp on what is happening due to the complexity of the entire system involved.

In the end if my router mysteriously gets into NTP sync after 20 days I suppose I am happy. At least I don't need to downgrade it to 2.2.6. Although I would really like to know exactly what is happening.

Thanks again for your time,
Tim

ntp.gif_thumb

johnpoz

That is an odd looking graph.. But as you can see your freq was moving down.. the whole time.. your disp or jitter between samples seems to have been all over the board?? Which would explain your offsets fluctuating alot..

I can tell you I have always had either the native tools way back when.. Since they went to 2.2 native tools have been bad idea..

So how long did it take to go stable? And did you keep rebooting it and changing the version of pfsense? So its been couple of days since I rebooted, my offsets are less than 3 miliseconds.. my cljit is in the dirt, etc. But as I look my poll is back down to 256 vs 512..

Whats your freq at now? You can view it with rv

ntpq> rv
associd=0 status=0615 leap_none, sync_ntp, 1 event, clock_sync,
version="ntpd 4.2.8p8@1.3265-o Fri Jun 10 15:40:53 UTC 2016 (1)",
processor="amd64", system="FreeBSD/10.3-RELEASE-p3", leap=00, stratum=2,
precision=-18, rootdelay=0.426, rootdisp=16.529, refid=192.168.9.32,
reftime=db110ace.089bf9c9 Sun, Jun 19 2016 7:13:02.033,
clock=db110c02.45cd7d1a Sun, Jun 19 2016 7:18:10.272, peer=56395, tc=8,
mintc=3, offset=-2.049850, frequency=-55.031, sys_jitter=0.132032,
clk_jitter=0.419, clk_wander=0.109
ntpq>

TugBoat

Yes, the freq was indeed moving down over the period. What I don't know is why this took 3 weeks to occur. Unfortunately, I only turned the RRD data on for NTP a few days ago. For the preceding period I only have the NTP status information in my notes. The wildly oscillating offset and jitter was what initially attracted my attention.

So perhaps the question is how did the frequency get so far out? Even so why did it take 20 days to get into 'sync'? A bit difficult to know after the event.

The production 2.3 pfsense (where the graph came from) has been unaltered, other than turning on the RRD data. All my testing was done with other VMs running various versions of pfsense with and without the tools. I suppose the obvious hypothesis is that turning on the NTP RRD data, which restarts NTP, somehow fixed the problem (changed the NTP server set?). Whatever had happened prior to the restart had resulted in the clock being adjusted way out of whack and this took a long time to sort out.

Currently on the production system the stats look good:

ntpq> rv
associd=0 status=0615 leap_none, sync_ntp, 1 event, clock_sync,
version="ntpd 4.2.8p7@1.3265-o Mon May 16 19:34:33 UTC 2016 (1)",
processor="amd64", system="FreeBSD/10.3-RELEASE-p3", leap=00, stratum=3,
precision=-23, rootdelay=40.705, rootdisp=47.755, refid=103.38.120.36,
reftime=db11a980.e8657e44  Mon, Jun 20 2016  9:30:08.907,
clock=db11aaa2.6c60659c  Mon, Jun 20 2016  9:34:58.423, peer=48606, tc=9,
mintc=3, offset=0.082601, frequency=-16.581, sys_jitter=0.203869,
clk_jitter=0.439, clk_wander=0.038
ntpq>

I suppose I could go back and repeat all the testing. I still have my notes and I could start again and try and repeat the whole set of procedures, but that is probably flogging a dead horse at this stage. It would certainly take another day, which I can't really afford at this stage.

It took me a whole day of testing to come to my initial "conclusion" that something altered when I installed the vmtools onto the 2.3 pfsense. However, my production system is now proof that that "conclusion" was wrong.

I am certain that I have missed some critical piece of information in my testing, but I have no idea what it is.

Tim

johnpoz

So your showing neg freq now

frequency=-16.581

Is that graphing for you, or does it show zero?

TugBoat

Interesting question…

The graphing shows zero (assuming we are talking about what is shown as 'freq' on the graphs). It has been zero since whatever happened when everything "started working".

The current report from NTP is:

ntpq> rv
associd=0 status=0615 leap_none, sync_ntp, 1 event, clock_sync,
version="ntpd 4.2.8p7@1.3265-o Mon May 16 19:34:33 UTC 2016 (1)",
processor="amd64", system="FreeBSD/10.3-RELEASE-p3", leap=00, stratum=3,
precision=-23, rootdelay=42.498, rootdisp=61.164, refid=103.38.120.36,
reftime=db1463a5.e8e703d7  Wed, Jun 22 2016 11:08:53.909,
clock=db1465a9.fbea4de9  Wed, Jun 22 2016 11:17:29.984, peer=48606, tc=9,
mintc=3, offset=-0.156099, frequency=-16.618, sys_jitter=0.865677,
clk_jitter=0.570, clk_wander=0.038
ntpq>

Tim

johnpoz

I submitted a bug report on the freq showing zero on the graph with neg numbers..

TugBoat

A quick followup on this issue:

This issue is not related to the virtualisation, it is related to negative drift coefficients. My hardware requires a drift coefficient of approx -15.

The drift file /var/db/ntp.drift is either being removed (due to a negative coefficient?) or is not saved across a system restart. I am not sure which is the case. At this stage I have not had the time to investigate. All I know is that when I log in after a pfsense restart there is no ntp.drift file.

In this situation NTP starts for some reason with +500 as the drift. Given that the hardware requires -15 it takes a very long time for the NTP daemon to sort things out. In my case I can fix the problem instantly by:

1. Stop the NTP service from the pfsense web admin.
2. Create the /var/db/ntp.drift file and put in -15.000 as the value
3. restart NTP from the web admin

If I do this the entire NTP system stabilises in no time (5 minutes) and everything is OK from then on.

Tim