Ntpd silently exiting if time is substantially off



  • I'm running pfsense 2.1-BETA0 running ntpd 4.2.6p5 in a VBox VM. If I "suspend" the host system overnight, the clock of the FreeBSD/pfsense VM will be off by several hours when the host system wakes up again the next morning.

    In this configuration I've often noticed that pfsense's ntpd process has exited (note: silently, no message in /var/log/ntpd.log) leaving time wrong, which I believe to be due to ntpd's behavior of "once the clock has been set, an error greater than 1000 s will cause ntpd to exit anyway." src

    I wonder if it's possible to configure ntpd to not exit, but sync time regardless when system time is substantially off ?



  • This is an interesting read on the subject:

    http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf

    It seems the way to solve this is to add "tinker panic 0" at the top of ntpd.conf

    Using NTP in Linux and Other Guests

    The Network Time Protocol is usable in a virtual machine with proper configuration of the NTP daemon.

    The following points are important:
    • Do not configure the virtual machine to synchronize to its own (virtual) hardware clock, not even as a fallback with a high stratum number. Some sample ntpd.conf files contain a section specifying the local clock as a potential time server, often marked with the comment “undisciplined local clock.” Delete any such server specification from your ntpd.conf file.
    • Include the option tinker panic 0 at the top of your ntp.conf file. By default, the NTP daemon sometimes panics and exits if the underlying clock appears to be behaving erratically. This option causes the daemon to keep running instead of panicking.
    • Follow standard best practices for NTP: Choose a set of servers to synchronize to that have accurate time and adequate redundancy.

    If you have many virtual or physical client machines to synchronize, set up some internal servers for them to use, so that all your clients are not directly accessing an external low-stratum NTP server and overloading it with requests.

    The following sample ntp.conf file is suitable if you have few enough clients that it makes sense for them to access an external NTP server directly. If you have many clients, adapt this file by changing the server names to reference your internal NTP servers.

    NOTE: Any tinker commands used must appear first.

    ntpd.conf

    tinker panic 0
    restrict 127.0.0.1
    restrict default kod nomodify notrap
    server 0.vmware.pool.ntp.org
    server 1.vmware.pool.ntp.org
    server 2.vmware.pool.ntp.org
    server 3.vmware.pool.ntp.org

    Here is a sample /etc/ntp/step-tickers corresponding to the sample ntp.conf file above.

    step-tickers

    0.vmware.pool.ntp.org
    1.vmware.pool.ntp.org

    Make sure that ntpd is configured to start at boot time. On some distributions this can be accomplished with the command chkconfig ntpd on, but consult your distribution’s documentation for details. On most distributions, you can start ntpd manually with the command /etc/init.d/ntpd start.

    PS: I've changed my local copy of the code that creates ntpd.conf (editing /var/etc/system.inc) and I'll let you know if it fixes it.



  • @dhatz:

    PS: I've changed my local copy of the code that creates ntpd.conf (editing /var/etc/system.inc) and I'll let you know if it fixes it.

    When using "tinker panic 0" ntpd won't exit if time is substantially off, but it doesn't seem to re-sync it also … I'll have a look at ntpd's config to see if there's anything about it.

    Btw is there some "watchdog" cron job that monitors running services (racoon, openvpn, ntpd etc) and re-starts them if needed ?



  • Apparently the "tinker panic 0" change does fix the issue afterall:

    1. ntpd doesn't exit anymore if time diff > 1000s, and
    2. eventually it re-syncs time (didn't monitor it closely enough to see how fast it does it)

  • Rebel Alliance Developer Netgate

    I committed the fix, should be in the next snapshot.



  • @jimp:

    I committed the fix, should be in the next snapshot.

    My opinion is that NTP was not intended for use on a virtual machine and this setting should be an option as a workaround.
    I know NTP is still under development but it is not very secure with missing "restrict" lines.


  • LAYER 8 Global Moderator

    "My opinion is that NTP was not intended for use on a virtual machine"

    Funny how this is not vmwares opinion ;)

    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1006427
    Timekeeping best practices for Linux guests

    Note: VMware recommends you to use NTP instead of VMware Tools periodic time synchronization. NTP is an industry standard and ensures accurate time keeping in your guest. You may have to open the firewall (UDP 123) to allow NTP traffic.

    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1318
    Timekeeping best practices for Windows, including NTP

    Windows Version Recommended Time Sync Utility
    Windows 2008 w32time or NTP
    Windows Vista w32time or NTP
    Windows 2003 w32time or NTP
    Windows XP NTP
    Windows 2000 NTP

    As to security concerns - the configuration tab allows you to restrict which interfaces it will listen on.  I don't see it much of a concern if you let your time server serve time to your local lan ;)

    I am sure as the addition of it gets more mature that more detailed configuration like specific restricts would be coming - worse case you can always modify the ntpd.conf in /etc/var if your really paranoid.



  • @johnpoz:

    Funny how this is not vmwares opinion ;)

    That is understandable from their point of view.
    I don't say I made a study of the NTP protocol but I read the newsgroup and am a Pool member (so I want to serve time on WAN). As far as I understand the NTP protocol it is made for 24/7, even under VMware  ;)
    It is very good that PfSense switched to the latest version of the NTP protocol because it is actively developed.

    I switched to PfSense as week ago and am now studying the behavior of NTP on my box. It certainly behaves well!
    I know now how to change the settings in /etc/inc/system.inc (and not /var/etc/ntpd.conf!; overwritten at ntpd restart), but these are also not permanent. That is how I have done it for now.



  • My personal experience of time sync on vmware and physical over several years has yielded the following:

    Set the ESXis to sync via ntp to five sources
    Windows DCs - use vm guest tools to sync to the host they are on
    Windows non DCs - leave at defaults, ie sync to the PDC emulator
    Unix style systems (*BSD, Linux et al) - sync via ntpd to the hosts

    I watch timesync on around 500 odd systems around the country (UK) via Nagios and they all agree on time to within the last one or two milli-seconds depending on OS (Unix is best, Windows worst, if you count a milli-second drift on a VM as "bad").

    I have not had to restart either ntpd or "windows time" in a very long … time using these rules.

    With a manually configured ntpd I use tinker panic 0 to avoid a 30 second drift being considered "insane".  I also use iburst on the server lines to get a much quicker initial sync, and [ssh] I see PF does as well.

    I used to use three pool systems but found that after a few weeks/months time would start to drift.  Since using five ({0,1,2,3}.pool.ntp.org and 0.uk.pool.ntp.org) I have not seen that behaviour on any system I manage in at least the last four years.

    Cheers
    Jon



  • I didn't find the time yet to monitor ntpd closely during a VM suspend cycle, but empirically I can say it can take quite a long time for ntpd to sync, e.g. it's been over 1hr since I restarted the suspended VM, yet ntpd still hasn't corrected the system time:

    ntpq -p
        remote          refid      st t when poll reach  delay  offset  jitter

    cache.asda.gr  131.188.3.221    2 u  392  512  377  16.808  3499206 1870404
    stitch.fr.zerol 192.93.2.20      2 u  399  512  377  72.440  3499206 1870404
    noc.be.it2go.eu 193.190.230.65  2 u  160  512  377  89.826  3499206 1322575


  • Rebel Alliance Developer Netgate

    We usually rely on ntpdate to make the big changes, and then let ntpd handle keeping the clock in line over time. If you restart ntpd (or save the settings, iirc) it will stop ntpd, run ntpdate, then restart ntpd.

    It can take a long time for ntpd to recover from a large skew, since it will only step the clock by 0.128 second increments. This can be adjusted with the step parameter to the tinker config option, but I recall setting that larger had negative effects.

    Though -x to ntpd might help…

    -x      Normally, the time is slewed if the offset is less than the step
                threshold, which is 128 ms by default, and stepped if above the
                threshold.  This option sets the threshold to 600 s, which is
                well within the accuracy window to set the clock manually.  Note:
                Since the slew rate of typical Unix kernels is limited to 0.5
                ms/s, each second of adjustment requires an amortization interval
                of 2000 s.  Thus, an adjustment as much as 600 s will take almost
                14 days to complete.  This option can be used with the -g and -q
                options.  See the tinker command for other options.  Note: The
                kernel time discipline is disabled with this option.



  • Btw ntpd finally sync'ed time, apparently in one big step, but it took ~1.5hr after the VM was resumed from yesterday's suspend. No message whatsoever in  /var/log/ntpd.log



  • As a test, I'm currently running ntpd with the following two lines added:

    server  127.127.1.0    # local clock
    fudge  127.127.1.0 stratum 10

    server says that the local system clock is a timeserver. fudge says that this server is stratum 10. If you are connected to the Internet then you are likely using timeservers who are more l33t than stratum 10 what time it is, and these servers are used because they have lower stratum and thus; higher priority

    However, if you are disconnected from the Internet then they are unavailable and you're left with the local clock. Using fudge to say that the local clock is stratum 10 makes ntp use the local clock when no timeservers are available. This is good because it makes sure you can disconnect your box from the Internet without getting your clock screwed.



  • @dhatz:

    As a test, I'm currently running ntpd with the following two lines added:

    If you are trying to run ntpd isolated you should use "orphan mode" in this version of ntpd.

    http://www.eecis.udel.edu/~mills/ntp/html/orphan.html


Log in to reply