Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Quality RRD data collection on WAN apparently stopped nearly a year ago

    Scheduled Pinned Locked Moved General pfSense Questions
    19 Posts 5 Posters 5.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • W
      wallabybob
      last edited by

      I'm currently running

      2.1-RC0 (i386)
      built on Thu May 23 19:52:31 EDT 2013
      FreeBSD 8.3-RELEASE-p8

      after many firmware updates.

      Today I noticed my Status -> RRD Graphs -> Quality -> WAN graphs are completely empty, the WAN_DHCP graph has data up to week 12 on the 3 months, 1 day average graph and nothing thereafter, the GW_WAN graph has data up to about mid June on the 1 year, 1 day average graph and nothing thereafter.

      Here are the files that seem to be relevant:

      [2.1-RC0][root@pfsense2.test.example.org]/root(32): ls -l /var/db/rrd/
      total 9000
      -rw-r–r--  1 nobody  wheel  47608 Jun 11  2012 GW_WAN-quality.rrd
      -rw-r--r--  1 nobody  wheel  47608 May 19  2012 OPT3VIRGINBROADBANDMOBILE_-quality.rrd
      -rw-r--r--  1 nobody  wheel  47608 Feb 29  2012 WAN-quality.rrd
      -rw-r--r--  1 nobody  wheel  47608 Mar 28 06:03 WAN_DHCP-quality.rrd
      -rw-r--r--  1 nobody  wheel  195424 Feb 29  2012 captiveportal-cpZone-concurrent.rrd
      -rw-r--r--  1 nobody  wheel  195424 Feb 29  2012 captiveportal-cpZone-loggedin.rrd
      -rw-r--r--  1 nobody  wheel  195424 May 25 18:20 captiveportal-cpzone-concurrent.rrd
      -rw-r--r--  1 nobody  wheel  195424 May 27 10:30 captiveportal-cpzone-loggedin.rrd
      -rw-r--r--  1 nobody  wheel  388984 May 27 10:30 ipsec-packets.rrd
      -rw-r--r--  1 nobody  wheel  388984 May 27 10:30 ipsec-traffic.rrd
      -rw-r--r--  1 nobody  wheel  388984 May 27 10:30 lan-packets.rrd
      -rw-r--r--  1 nobody  wheel  388984 May 27 10:30 lan-traffic.rrd
      -rw-r--r--  1 nobody  wheel  388984 Oct 24  2012 opt1-packets.rrd
      -rw-r--r--  1 nobody  wheel  388984 Oct 24  2012 opt1-traffic.rrd
      -rw-r--r--  1 nobody  wheel  146224 Oct 23  2012 opt1-wireless.rrd
      -rw-r--r--  1 nobody  wheel  388984 Oct 24  2012 opt2-packets.rrd
      -rw-r--r--  1 nobody  wheel  388984 Oct 24  2012 opt2-traffic.rrd
      -rw-r--r--  1 nobody  wheel  146224 May 25 18:20 opt3-cellular.rrd
      -rw-r--r--  1 nobody  wheel  388984 May 27 10:30 opt3-packets.rrd
      -rw-r--r--  1 nobody  wheel  388984 May 27 10:30 opt3-traffic.rrd
      -rw-r--r--  1 nobody  wheel  388984 May 27 10:30 opt4-packets.rrd
      -rw-r--r--  1 nobody  wheel  388984 May 27 10:30 opt4-traffic.rrd
      -rw-r--r--  1 nobody  wheel  388984 May 25 13:10 opt6-packets.rrd
      -rw-r--r--  1 nobody  wheel  388984 May 25 13:10 opt6-traffic.rrd
      -rw-r--r--  1 nobody  wheel  97672 May 30  2012 ppp-cellular.rrd
      -rw-r--r--  1 nobody  wheel  727424 May 27 10:30 system-memory.rrd
      -rw-r--r--  1 nobody  wheel  243328 May 27 10:30 system-processor.rrd
      -rw-r--r--  1 nobody  wheel  243328 May 27 10:30 system-states.rrd
      -rw-r--r--  1 root    wheel    7011 May 25 18:19 updaterrd.sh
      -rw-r--r--  1 nobody  wheel  388984 May 27 10:30 wan-packets.rrd
      -rw-r--r--  1 nobody  wheel  388984 May 27 10:30 wan-traffic.rrd
      [2.1-RC0][root@pfsense2.test.example.org]/root(33): cat /var/etc/apinger.conf

      pfSense apinger configuration file. Automatically Generated!

      User and group the pinger should run as

      user "root"
      group "wheel"

      Mailer to use (default: "/usr/lib/sendmail -t")

      #mailer "/var/qmail/bin/qmail-inject"

      Location of the pid-file (default: "/var/run/apinger.pid")

      pid_file "/var/run/apinger.pid"

      Format of timestamp (%s macro) (default: "%b %d %H:%M:%S")

      #timestamp_format "%Y%m%d%H%M%S"

      status {

      File where the status information should be written to

      file "/var/run/apinger.status"

      Interval between file updates

      when 0 or not set, file is written only when SIGUSR1 is received

      interval 5s
      }

      ########################################

      RRDTool status gathering configuration

      Interval between RRD updates

      rrd interval 60s;

      These parameters can be overridden in a specific alarm configuration

      alarm default {
      command on "/usr/local/sbin/pfSctl -c 'service reload dyndns %T' -c 'service reload ipsecdns' -c 'service reload openvpn %T' -c 'filter reload' "
      command off "/usr/local/sbin/pfSctl -c 'service reload dyndns %T' -c 'service reload ipsecdns' -c 'service reload openvpn %T' -c 'filter reload' "
      combine 10s
      }

      "Down" alarm definition.

      This alarm will be fired when target doesn't respond for 30 seconds.

      alarm down "down" {
      time 10s
      }

      "Delay" alarm definition.

      This alarm will be fired when responses are delayed more than 200ms

      it will be canceled, when the delay drops below 100ms

      alarm delay "delay" {
      delay_low 200ms
      delay_high 500ms
      }

      "Loss" alarm definition.

      This alarm will be fired when packet loss goes over 20%

      it will be canceled, when the loss drops below 10%

      alarm loss "loss" {
      percent_low 10
      percent_high 20
      }

      target default {

      How often the probe should be sent

      interval 1s

      How many replies should be used to compute average delay

      for controlling "delay" alarms

      avg_delay_samples 10

      How many probes should be used to compute average loss

      avg_loss_samples 50

      The delay (in samples) after which loss is computed

      without this delays larger than interval would be treated as loss

      avg_loss_delay_samples 20

      Names of the alarms that may be generated for the target

      alarms "down","delay","loss"

      Location of the RRD

      #rrd file "/var/db/rrd/apinger-%t.rrd"
      }
      [2.1-RC0][root@pfsense2.test.example.org]/root(35):

      I compared apinger.conf with that on a pfSense 2.0.1 system that had up to date WAN quality data and noted there was no target section in apinger.conf on pfSense 2.1. On a whim I went to System -> Routing and on the Gateways tab clicked on the e button beside the WAN gateway then, without changing anything, clicked on Save and Apply. Then the apinger.conf file had a target section:

      [2.1-RC0][root@pfsense2.test.example.org]/root(35): cat /var/etc/apinger.conf

      pfSense apinger configuration file. Automatically Generated!

      User and group the pinger should run as

      user "root"
      group "wheel"

      Mailer to use (default: "/usr/lib/sendmail -t")

      #mailer "/var/qmail/bin/qmail-inject"

      Location of the pid-file (default: "/var/run/apinger.pid")

      pid_file "/var/run/apinger.pid"

      Format of timestamp (%s macro) (default: "%b %d %H:%M:%S")

      #timestamp_format "%Y%m%d%H%M%S"

      status {

      File where the status information should be written to

      file "/var/run/apinger.status"

      Interval between file updates

      when 0 or not set, file is written only when SIGUSR1 is received

      interval 5s
      }

      ########################################

      RRDTool status gathering configuration

      Interval between RRD updates

      rrd interval 60s;

      These parameters can be overridden in a specific alarm configuration

      alarm default {
      command on "/usr/local/sbin/pfSctl -c 'service reload dyndns %T' -c 'service reload ipsecdns' -c 'service reload openvpn %T' -c 'filter reload' "
      command off "/usr/local/sbin/pfSctl -c 'service reload dyndns %T' -c 'service reload ipsecdns' -c 'service reload openvpn %T' -c 'filter reload' "
      combine 10s
      }

      "Down" alarm definition.

      This alarm will be fired when target doesn't respond for 30 seconds.

      alarm down "down" {
      time 10s
      }

      "Delay" alarm definition.

      This alarm will be fired when responses are delayed more than 200ms

      it will be canceled, when the delay drops below 100ms

      alarm delay "delay" {
      delay_low 200ms
      delay_high 500ms
      }

      "Loss" alarm definition.

      This alarm will be fired when packet loss goes over 20%

      it will be canceled, when the loss drops below 10%

      alarm loss "loss" {
      percent_low 10
      percent_high 20
      }

      target default {

      How often the probe should be sent

      interval 1s

      How many replies should be used to compute average delay

      for controlling "delay" alarms

      avg_delay_samples 10

      How many probes should be used to compute average loss

      avg_loss_samples 50

      The delay (in samples) after which loss is computed

      without this delays larger than interval would be treated as loss

      avg_loss_delay_samples 20

      Names of the alarms that may be generated for the target

      alarms "down","delay","loss"

      Location of the RRD

      #rrd file "/var/db/rrd/apinger-%t.rrd"
      }
      target "192.168.211.173" {
      description "GW_WAN"
      srcip "192.168.211.217"
      alarms override "loss","delay","down";
      rrd file "/var/db/rrd/GW_WAN-quality.rrd"
      }

      [2.1-RC0][root@pfsense2.test.example.org]/root(36):

      and a little while later noted that the GW_WAN quality.rrd file had been updated:

      [2.1-RC0][root@pfsense2.test.example.org]/root(36): ls -l /var/db/rrd/GW_WAN-quality.rrd
      -rw-r–r--  1 nobody  wheel  47608 May 27 11:51 /var/db/rrd/GW_WAN-quality.rrd
      [2.1-RC0][root@pfsense2.test.example.org]/root(37):

      and now the Status -> RRD Graphs -> Quality graphs for GW_WAN show data for a few minutes back then nothing back to June 2012.

      It appears my configuration file was missing something that was added by the save:

      [2.1-RC0][root@pfsense2.test.example.org]/conf/backup(39): diff config-1369458202.xml /conf/config.xml
      827,829c827,829
      < <time>1369458202</time>
      <
      < <username>(system)</username>
      –-

      <time>1369618717</time>

      <username>admin@192.168.211.241</username>
      840a841,842
      <ipprotocol>inet</ipprotocol>
      <interval>842a845
      <defaultgw>[2.1-RC0][root@pfsense2.test.example.org]/conf/backup(40):</defaultgw></interval>

      1 Reply Last reply Reply Quote 0
      • T
        torontob
        last edited by

        I have to reset RRD graphs to get either of WAN_PPPoE Quality or Traffic every few days on a 2.1-RC1. This is really annoying now. I looked at your findings and it seems to not be my issue as I have the target DSL PPPoE gateway in my apinger.conf file. Here are my files:

        $ ls -l /var/db/rrd/
        total 15352
        -rw-r--r--  1 nobody  wheel   47608 Sep  8 11:16 PBXVPN_VPNV4-quality.rrd
        -rw-r--r--  1 nobody  wheel   47608 Sep  8 11:16 VPNINBOUND_VPNV4-quality.rrd
        -rw-r--r--  1 nobody  wheel   47608 Sep  8 11:16 WAN_PPPOE-quality.rrd
        -rw-r--r--  1 nobody  wheel  393080 Sep  7 00:13 ipsec-packets.rrd
        -rw-r--r--  1 nobody  wheel  393080 Sep  7 00:13 ipsec-traffic.rrd
        -rw-r--r--  1 nobody  wheel  393080 Sep  7 00:13 lan-packets.rrd
        -rw-r--r--  1 nobody  wheel  393080 Sep  7 00:13 lan-traffic.rrd
        -rw-r--r--  1 nobody  wheel  393080 Sep  7 00:13 opt1-packets.rrd
        -rw-r--r--  1 nobody  wheel  393080 Sep  7 00:13 opt1-traffic.rrd
        -rw-r--r--  1 nobody  wheel  393080 Sep  7 00:13 opt2-packets.rrd
        -rw-r--r--  1 nobody  wheel  393080 Sep  7 00:13 opt2-traffic.rrd
        -rw-r--r--  1 nobody  wheel  393080 Sep  7 00:13 opt3-packets.rrd
        -rw-r--r--  1 nobody  wheel  393080 Sep  7 00:13 opt3-traffic.rrd
        -rw-r--r--  1 nobody  wheel  393080 Sep  7 00:13 ovpns2-packets.rrd
        -rw-r--r--  1 nobody  wheel  393080 Sep  7 00:13 ovpns2-traffic.rrd
        -rw-r--r--  1 nobody  wheel   49632 Sep  7 00:13 ovpns2-vpnusers.rrd
        -rw-r--r--  1 nobody  wheel  588376 Sep  7 00:13 system-mbuf.rrd
        -rw-r--r--  1 nobody  wheel  735104 Sep  7 00:13 system-memory.rrd
        -rw-r--r--  1 nobody  wheel  245888 Sep  7 00:13 system-processor.rrd
        -rw-r--r--  1 nobody  wheel  245888 Sep  7 00:13 system-states.rrd
        -rw-r--r--  1 root    wheel    8799 Sep  7 00:13 updaterrd.sh
        -rw-r--r--  1 nobody  wheel  393080 Sep  7 00:13 wan-packets.rrd
        -rw-r--r--  1 nobody  wheel  393080 Sep  7 00:13 wan-traffic.rrd
        

        I think this is a serious issue with PPPoE? I have seen this on all boxes with RC1 since last month :(

        Any thoughts?

        Thanks,

        1 Reply Last reply Reply Quote 0
        • W
          wallabybob
          last edited by

          Looking at the timestamps on your RRD files I suspect RRD collection stopped around midnight 7-Sep for a number of data sets including LAN packet and LAN traffic. What happened around then? system reboot due to power failure causing file corruption?

          I presume the more current timestamps on your WAN related files are because you "reset" them. What do you do to "reset" them?

          1 Reply Last reply Reply Quote 0
          • T
            torontob
            last edited by

            I have a2GB CF card installed so not sure why swap space run out and not sure if that is the cause of the RRD issues but here are the logs:

            Sep 7 00:12:16	check_reload_status: updating dyndns WAN_PPPOE
            Sep 7 00:12:16	check_reload_status: Restarting ipsec tunnels
            Sep 7 00:12:16	check_reload_status: Restarting OpenVPN tunnels/interfaces
            Sep 7 00:12:16	check_reload_status: Reloading filter
            Sep 7 00:12:24	php: rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN_PPPOE.
            Sep 7 00:12:24	php: rc.openvpn: OpenVPN: Resync server2 OVEPN-inbound
            Sep 7 00:12:24	php: rc.dyndns.update: phpDynDNS (rbpbxcam102.qbooksonline.net): No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry.
            Sep 7 00:12:25	kernel: ovpns2: link state changed to DOWN
            Sep 7 00:12:25	php: rc.openvpn: OpenVPN: Resync client1 PBX-VPN
            Sep 7 00:12:26	kernel: ovpns2: link state changed to UP
            Sep 7 00:12:26	kernel: ovpnc1: link state changed to DOWN
            Sep 7 00:12:26	check_reload_status: rc.newwanip starting ovpns2
            Sep 7 00:12:28	php: rc.filter_configure_sync: Could not find IPv4 gateway for interface (opt2).
            Sep 7 00:12:30	kernel: ovpnc1: link state changed to UP
            Sep 7 00:12:30	check_reload_status: rc.newwanip starting ovpnc1
            Sep 7 00:12:35	kernel: pid 80615 (php), uid 0, was killed: out of swap space
            Sep 7 00:12:36	php: rc.newwanip: rc.newwanip: Informational is starting ovpns2.
            Sep 7 00:12:36	php: rc.newwanip: rc.newwanip: on (IP address: 172.16.50.1) (interface: opt3) (real interface: ovpns2).
            Sep 7 00:12:38	kernel: pid 91802 (php), uid 0, was killed: out of swap space
            Sep 7 00:12:40	kernel: pid 43574 (php), uid 0, was killed: out of swap space
            Sep 7 00:12:42	kernel: pid 93676 (php), uid 0, was killed: out of swap space
            Sep 7 00:12:42	php: rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN_PPPOE.
            Sep 7 00:12:42	php: rc.openvpn: OpenVPN: Resync server2 OVEPN-inbound
            Sep 7 00:12:43	php: rc.dyndns.update: phpDynDNS (rbpbxcam102.qbooksonline.net): No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry.
            Sep 7 00:12:44	kernel: ovpns2: link state changed to DOWN
            Sep 7 00:12:44	check_reload_status: Reloading filter
            Sep 7 00:12:44	php: rc.openvpn: OpenVPN: Resync client1 PBX-VPN
            Sep 7 00:12:45	kernel: ovpns2: link state changed to UP
            Sep 7 00:12:45	php: rc.newwanip: rc.newwanip: Informational is starting ovpnc1.
            Sep 7 00:12:45	php: rc.newwanip: rc.newwanip: on (IP address: 172.20.20.10) (interface: opt2) (real interface: ovpnc1).
            Sep 7 00:12:45	kernel: ovpnc1: link state changed to DOWN
            Sep 7 00:12:45	check_reload_status: rc.newwanip starting ovpns2
            Sep 7 00:12:49	kernel: ovpnc1: link state changed to UP
            Sep 7 00:12:49	check_reload_status: rc.newwanip starting ovpnc1
            Sep 7 00:12:50	php: rc.newwanip: Creating rrd update script
            Sep 7 00:12:52	php: rc.newwanip: rc.newwanip: Informational is starting ovpns2.
            Sep 7 00:12:52	php: rc.newwanip: rc.newwanip: on (IP address: 172.16.50.1) (interface: opt3) (real interface: ovpns2).
            Sep 7 00:12:53	php: rc.newwanip: pfSense package system has detected an ip change 172.20.20.10 -> 172.20.20.10 ... Restarting packages.
            Sep 7 00:12:53	check_reload_status: Starting packages
            Sep 7 00:12:58	php: rc.newwanip: rc.newwanip: Informational is starting ovpnc1.
            Sep 7 00:12:58	php: rc.newwanip: rc.newwanip: on (IP address: 172.20.20.10) (interface: opt2) (real interface: ovpnc1).
            Sep 7 00:12:58	check_reload_status: updating dyndns PBXVPN_VPNV4
            Sep 7 00:12:58	check_reload_status: Restarting ipsec tunnels
            Sep 7 00:12:58	check_reload_status: Restarting OpenVPN tunnels/interfaces
            Sep 7 00:12:59	php: rc.newwanip: Creating rrd update script
            Sep 7 00:13:01	kernel: pid 3336 (php), uid 0, was killed: out of swap space
            Sep 7 00:13:02	php: rc.newwanip: pfSense package system has detected an ip change 172.16.50.1 -> 172.16.50.1 ... Restarting packages.
            Sep 7 00:13:04	php: rc.newwanip: Creating rrd update script
            Sep 7 00:13:06	kernel: pid 31988 (php), uid 0, was killed: out of swap space
            Sep 7 00:13:07	php: rc.newwanip: pfSense package system has detected an ip change 172.20.20.10 -> 172.20.20.10 ... Restarting packages.
            Sep 7 00:13:11	kernel: pid 34761 (php), uid 0, was killed: out of swap space
            Sep 7 00:13:13	kernel: pid 47671 (php), uid 0, was killed: out of swap space
            Sep 7 00:13:15	php: rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use PBXVPN_VPNV4.
            Sep 7 00:13:21	php: rc.start_packages: Restarting/Starting all packages.
            Sep 7 00:13:21	kernel: pid 65238 (php), uid 0, was killed: out of swap space
            Sep 7 00:13:24	php: rc.start_packages: Restarting/Starting all packages.
            Sep 7 00:13:24	php: rc.start_packages: The command '/usr/local/etc/rc.d/darkstat.sh stop' returned exit code '1', the output was 'No matching processes were found'
            Sep 7 00:13:27	php: rc.start_packages: You should specify an interface for bandwidthd to listen on. Exiting.
            

            Also, I reset by browsing to RRD Settings page and reset data there.

            Thanks

            1 Reply Last reply Reply Quote 0
            • K
              kejianshi
              last edited by

              When mine stopped, I deleted RRD data and it started working fine again  (Probably not the fix you were hoping for?)

              1 Reply Last reply Reply Quote 0
              • T
                torontob
                last edited by

                @kejianshi:

                When mine stopped, I deleted RRD data and it started working fine again  (Probably not the fix you were hoping for?)

                Thanks - Is there a way to safely delete other than RESET?
                This is annoying. I appreciate any other feedback.

                1 Reply Last reply Reply Quote 0
                • W
                  wallabybob
                  last edited by

                  @torontob:

                  I have a2GB CF card installed so not sure why swap space run out and not sure if that is the cause of the RRD issues

                  When the kernel goes killing "random" processes because there is no swap space file corruption is quite likely. I suspect you will probably need to get the swap space issue fixed to resolve this, but the feedback so far suggests "not until 2.2".

                  Does the offending system run the nano-BSD variant?

                  1 Reply Last reply Reply Quote 0
                  • T
                    torontob
                    last edited by

                    Sorry, what you mean "not until 2.2"? Is this a known issue?
                    I downloaded a snapshot of 2.1-RC1 and installed. No changes to base operating system.

                    Thanks,

                    1 Reply Last reply Reply Quote 0
                    • D
                      doktornotor Banned
                      last edited by

                      @torontob:

                      I have a2GB CF card installed so not sure why swap space run out and not sure if that is the cause of the RRD issues but here are the logs:

                      There is no swap used on nanobsd. You simply are running out of RAM => trying to do too much on an inadequate HW.

                      1 Reply Last reply Reply Quote 0
                      • W
                        wallabybob
                        last edited by

                        @torontob:

                        Sorry, what you mean "not until 2.2"? Is this a known issue?

                        See discussion in http://forum.pfsense.org/index.php/topic,66188.0.html in particular see reply #2 in which ermal says

                        There is a plan hopefully to be ready fro 2.2 to make this a non-issue.

                        I don't know that fixing the "out of swap space" issue will fix your RRD issue but I do think it is likely that your RRD issue is caused by the kernel killing processes (due to running out of swap space) and that leaving one or more RRD data files "corrupt".

                        1 Reply Last reply Reply Quote 0
                        • T
                          torontob
                          last edited by

                          @doktornotor:

                          @torontob:

                          I have a2GB CF card installed so not sure why swap space run out and not sure if that is the cause of the RRD issues but here are the logs:

                          There is no swap used on nanobsd. You simply are running out of RAM => trying to do too much on an inadequate HW.

                          Guys, I am using THE standard hardware - Alix2D13 - and I have only 30 SIP phones on pfSense. I have put much much much more on this hardware and it hasn't run out of RAM ever. I would like to clarify two things though and I think we are getting closer:

                          I see that it's not a swap issue (wallabybob) that is causing RRD graphs issue but killing processes due to that may be. So, what is my next step in diagnosing?

                          1 Reply Last reply Reply Quote 0
                          • D
                            doktornotor Banned
                            last edited by

                            @torontob:

                            I see that it's not a swap issue (wallabybob) that is causing RRD graphs issue but killing processes due to that may be. So, what is my next step in diagnosing?

                            Dude, that message tells you that you have run out of RAM. Period.

                            1 Reply Last reply Reply Quote 0
                            • P
                              phil.davis
                              last edited by

                              Sep 7 00:12:38	kernel: pid 91802 (php), uid 0, was killed: out of swap space
                              Sep 7 00:12:40	kernel: pid 43574 (php), uid 0, was killed: out of swap space
                              Sep 7 00:12:42	kernel: pid 93676 (php), uid 0, was killed: out of swap space
                              

                              I just want to clarify to make sure it is understood. On nanoBSD there is no swap partition or file, i.e. swap space is zero. Therefore when it says "out of swap space" it really means "I needed some swap space, but there is none" =  "out of real memory".
                              Some people (including me with some configs) have been having this issue on 2.1 with 256MB RAM systems. It seems to be a transient thing, for me particularly when WAN links go up/down and a few OpenVPN links reestablish at pretty much the same time. A lot of processes fire up initialising/restarting various things and for those few seconds there is not enough real memory for it all. By using the new System Watchdog package, things normally recover because after a minute System Services Watchdog wakes up and restarts important things that got killed.

                              As the Greek philosopher Isosceles used to say, "There are 3 sides to every triangle."
                              If I helped you, then help someone else - buy someone a gift from the INF catalog http://secure.inf.org/gifts/usd/

                              1 Reply Last reply Reply Quote 0
                              • T
                                torontob
                                last edited by

                                @phil.davis:

                                Sep 7 00:12:38	kernel: pid 91802 (php), uid 0, was killed: out of swap space
                                Sep 7 00:12:40	kernel: pid 43574 (php), uid 0, was killed: out of swap space
                                Sep 7 00:12:42	kernel: pid 93676 (php), uid 0, was killed: out of swap space
                                

                                I just want to clarify to make sure it is understood. On nanoBSD there is no swap partition or file, i.e. swap space is zero. Therefore when it says "out of swap space" it really means "I needed some swap space, but there is none" =  "out of real memory".
                                Some people (including me with some configs) have been having this issue on 2.1 with 256MB RAM systems. It seems to be a transient thing, for me particularly when WAN links go up/down and a few OpenVPN links reestablish at pretty much the same time. A lot of processes fire up initialising/restarting various things and for those few seconds there is not enough real memory for it all. By using the new System Watchdog package, things normally recover because after a minute System Watchdog wakes up and restarts important things that got killed.

                                Thanks for the clarification. So, I meant I have not done any modifications to cause swap issues. I do have couple OpenVPN links and do use DSL. So, if that becomes a problem and that is not a problem on 2.0 (which it is not for my other boxes) then it should be treated as a bug. Maybe tune down everything or create a delay for services to start?

                                Also, I see Services Watchdog - not System Watchdog - is that what you were referring to? Never used it though. Is that a stable package? (shows beta). I am wondering if that in itself creates more overhead sending me into a vicious circle :)

                                1 Reply Last reply Reply Quote 0
                                • W
                                  wallabybob
                                  last edited by

                                  @torontob:

                                  Also, I see Services Watchdog - not System Watchdog - is that what you were referring to? Never used it though. Is that a stable package? (shows beta). I am wondering if that in itself creates more overhead sending me into a vicious circle :)

                                  And I can't see how Services Watchdog will help if RRD data is getting corrupted because processes are killed due to swap space exhaustion.

                                  1 Reply Last reply Reply Quote 0
                                  • P
                                    phil.davis
                                    last edited by

                                    I meant Services Watchdog - I fixed the text in the previous.
                                    Services Watchdog works well for what it does, a simple check every minute of what is running or missing. Of course, it sometimes happens to run at the same time as some down/up event on WAN. So then it can try to start things that are already in the process of starting by the link up/down scripts, and also use memory itself when memory is short. Services Watchdog is really intended to restart services that should never go missing but genuinely happen to crash (program crashes, divide by zero…). To recover nicely from service exits due to low memory it needs more parameters so it can wait a bit longer if the service is down, then only start it if it has been missing for some time - avoiding race conditions when other code is in the process of getting a WAN link into the up state again.

                                    We have made code to "slow down" the boot process also - https://github.com/pfsense/pfsense/pull/798 - this helps ensure that critical things start OK during the boot. As Ermal comments, it is a "kludge" - using a heuristic that waits for free memory to stabilize in order to infer that the thing just started has initialised and stabilized, and it is time to move on in the bootup process. This "kludge" does nothing for memory problems due to WAN down/up etc after the system is booted. But for us, at least it gets us a functional system at boot with site-to-site OpenVPN. That means we can login remotely and are not stuck with doing telephone support to someone in a remote office after a power failure.

                                    My Alix 256MB systems typically run with memory usage on the dashboard at 45% to 55%. As long as all the physical links and OpenVPN links are working nicely, nothing restarts and there is plenty of memory. So these Alix systems can run with configs such as like I have: 2 WANs, 3 LANs (on VLAN switch), OpenVPN server for site-to-site links with 9 client offices connecting, OpenVPN server for Road Warrior with a handful of client connections, DHCP, DNS, only Cron, Sudo, Service Watchdog and OpenVPN Client Export packages. It just needs some serialization of real-time event handling so that the free memory is used in a nicely controlled manner when recovering from events.

                                    As the Greek philosopher Isosceles used to say, "There are 3 sides to every triangle."
                                    If I helped you, then help someone else - buy someone a gift from the INF catalog http://secure.inf.org/gifts/usd/

                                    1 Reply Last reply Reply Quote 0
                                    • P
                                      phil.davis
                                      last edited by

                                      @wallabybob:

                                      @torontob:

                                      Also, I see Services Watchdog - not System Watchdog - is that what you were referring to? Never used it though. Is that a stable package? (shows beta). I am wondering if that in itself creates more overhead sending me into a vicious circle :)

                                      And I can't see how Services Watchdog will help if RRD data is getting corrupted because processes are killed due to swap space exhaustion.

                                      Correct - if the only problem is the original post about corrupted RRD data, then that can happen whenever the "killed" happens randomly during the RRD data writing process, leaving part-updated corrupt data files. To fix that, the system has to be fixed so that "killed" never happens.
                                      Services Watchdog or other recovery methods are only useful to get processes running again, as long as their associated data files are in good shape.

                                      As the Greek philosopher Isosceles used to say, "There are 3 sides to every triangle."
                                      If I helped you, then help someone else - buy someone a gift from the INF catalog http://secure.inf.org/gifts/usd/

                                      1 Reply Last reply Reply Quote 0
                                      • T
                                        torontob
                                        last edited by

                                        Thanks wallabybob for input.
                                        Since the topic changed a bit, I might as well ask this question. From time to time, I see that OpenVPN is not trying to re-establish a connection. Why is that? why is there a need for something like Watchdog and why won't vital services just automatically restart?

                                        Thanks,

                                        1 Reply Last reply Reply Quote 0
                                        • P
                                          phil.davis
                                          last edited by

                                          If a critical service process (dhcp, dns, OpenVPNs…) exits abnormally (system problem like out of swap space, program problem like divide by zero) there is no controlling script that called it that then gets the error code, and can loop around to try and run the program again. These services are forked off into independent processes by the bootup scripts.
                                          There is nothing else built into the system that monitors AND restarts them automatically. Yes, there are dashboard displays, but they don't take automatic action. And in any case, what if nobody has a dashboard running.
                                          So JimP has kindly made Service Watchdog - in a perfect world it would not be needed, just like a real watchdog.

                                          OpenVPN: in my experience the OpenVPN code is very good at trying forever to connect and eventually connecting once the underlying physical links and internet is up and working. I expect that an OpenVPN client would only stop trying to establish a connection if the process has actually crashed, which I have only seen happen because of "killed: out of swap space".

                                          As the Greek philosopher Isosceles used to say, "There are 3 sides to every triangle."
                                          If I helped you, then help someone else - buy someone a gift from the INF catalog http://secure.inf.org/gifts/usd/

                                          1 Reply Last reply Reply Quote 0
                                          • First post
                                            Last post
                                          Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.