Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Status: Monitoring is completely broken, pfSense 2.4.5

    Scheduled Pinned Locked Moved webGUI
    monitoring
    46 Posts 8 Posters 7.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • S
      scurrier
      last edited by

      Figured out where the problematic value of 7200 was coming from. It's from the RRD file itself when queried in rrd_fetch_json.php line 168 with the rrd_fetch() function and the options deriving from the POST data I attached in a picture above. I constructed the rrdtool fetch command that should result from that POST data and ran it on the command line against the file directly:

      me@my-machine:~/pfsense$ rrdtool fetch rrd/WAN_DHCP-quality.rrd AVERAGE -r 3600 -s now-1m+1hour -e 1595460745-1hour
                                 loss               delay              stddev
      
      1592978400: 0.0000000000e+00 1.5691636408e-02 6.2196527951e-03
      1592985600: 0.0000000000e+00 9.1416309671e-03 2.7422781960e-03
      1592992800: 0.0000000000e+00 8.8436429234e-03 2.5432010581e-03
      1593000000: 2.6234902083e-02 9.2394222539e-03 3.5694343962e-03
      1593007200: 5.2510052500e-02 1.0458579000e-02 4.6231059930e-03
      1593014400: 0.0000000000e+00 1.0825514267e-02 4.7963684750e-03
      <snip>
      1595419200: 0.0000000000e+00 8.0070629420e-03 1.7023876495e-03
      1595426400: 0.0000000000e+00 8.5285815145e-03 2.3826874231e-03
      1595433600: 7.1014380729e+00 8.6965289475e-03 2.7509416831e-03
      1595440800: 4.1289524583e-02 8.8894945167e-03 2.6180607746e-03
      1595448000: 6.1264454861e-02 8.7516545776e-03 2.4379814676e-03
      1595455200: 0.0000000000e+00 8.6976092202e-03 2.6467615867e-03
      1595462400: -nan -nan -nan
      

      The values to the left of the colons are some kind of timestamp in seconds. If you look at the difference between them, you'll see it's 7200. I believe rrd_fetch() function is using that difference to determine a step property for the result that is used on line 174 . Later, this data is referenced as data[0].step on line 1139 of status_monitoring.php as shown in my post from 2 days ago and the problem occurs when there's no matching key in timeLookup.

      So, here we have traced the problem all the way back to the RRD file itself. Looks like this potential step size was not anticipated and so was not included in the timeLookup array. My firewall has been running for 6 years, so maybe that length of time has something to do with it? Resolution has decreased as things filled up? I don't know. The good news is that it appears the fix is as easy as adding a line to timeLookup to account for it. Either that or diving really deep into RRD tool or the place where RRD tool is invoked to create the files and figure out if anything there could be causing it. I don't plan on doing that.

      1 Reply Last reply Reply Quote 0
      • S
        scurrier
        last edited by scurrier

        The rrdtool fetch documentation even describes that the resolution argument may not be honored. That's what's happening here. We asked for resolution 3600, but it's not honored.

        --resolution|-r resolution (default is the highest resolution)
        
            the interval you want the values to have (seconds per value). An optional suffix
        may be used (e.g. 5m instead of 300 seconds). rrdfetch will try to match your request,
        but it will return data even if no absolute match is possible.
        
        1 Reply Last reply Reply Quote 0
        • S
          serbus
          last edited by

          Hello!

          Have you tried resetting your rrd data using the "Reset Data" button in Status -> Monitoring -> Settings, or with "/bin/rm /var/db/rrd/*" ?

          John

          Lex parsimoniae

          S 1 Reply Last reply Reply Quote 0
          • S
            scurrier @serbus
            last edited by

            @serbus No, because I want to keep my data.

            1 Reply Last reply Reply Quote 0
            • S
              serbus @scurrier
              last edited by

              @scurrier said in Status: Monitoring is completely broken, pfSense 2.4.5:

              A long time ago I might have tried changing the RRD settings to retain more data points or something. Not sure, hard to remember.

              Hello!

              You could try backing up the /var/db/rrd folder and then resetting.
              Maybe just manually popping the rrd file for the interface/dataset that if giving you problems.
              Retaining the data may not be worth it if you cant display it the way you want to, but maybe there is an easy code workaround.

              John

              Lex parsimoniae

              S 1 Reply Last reply Reply Quote 0
              • S
                scurrier @serbus
                last edited by

                @serbus I think you're right it will probably fix it. I'm not going to do it now, but may decide to try tomorrow. Regarding potentially having changed the rrd settings, is there even a place to do that in the gui? I'm not the kind to go screwing under the hood.

                1 Reply Last reply Reply Quote 0
                • S
                  serbus
                  last edited by

                  Hello!

                  I dont know if there are rrd tweaks in the gui.

                  There is a RRD Data option in Diagnostics -> Backup & Restore that could simplify saving and recovering your data if you want to fool around with it.

                  John

                  Lex parsimoniae

                  1 Reply Last reply Reply Quote 0
                  • GertjanG
                    Gertjan
                    last edited by

                    rrd files can be modified .

                    No GUI, as you're dealing with pure data chunks.
                    pfSense has the tool.

                    rrdtool dump /var/db/rrd/lan-traffic.rrd /root/lan-traffic.xml
                    

                    Now edit this xml file using your favorite editor.
                    When done :

                    rrdtool-f  /root/lan-traffic.xml  /var/db/rrd/lan-traffic.rrd
                    

                    No "help me" PM's please. Use the forum, the community will thank you.
                    Edit : and where are the logs ??

                    1 Reply Last reply Reply Quote 1
                    • bmeeksB
                      bmeeks
                      last edited by bmeeks

                      It looks like the answer has been found. It can be considered a bug, but a very obscure one that requires unusual circumstances to trigger -- namely a very, very large rrd dataset (the OP says his is 6 years old). It very well could be something rrd is doing internally once the dataset file reaches a certain size. Since that is an unusually large dataset, the other folks in the thread are unable to reproduce using their likely smaller datasets.

                      @scurrier: what size is your rrd file? Have you been running on the same hardware the entire 6 years? Just wondering if you do in fact have 6 years worth of data in a single contiguous file.

                      @scurrier: take the info you have collected, and the solution you found, and submit an official bug report on the pfSense Redmine site here: https://redmine.pfsense.org/. That will put it on the developers' plate for future work. If you have already submitted a bug report, please edit it if necessary and include all the information you collected in your posts above. That will be of great value to whomever works on the bug report.

                      1 Reply Last reply Reply Quote 1
                      • johnpozJ
                        johnpoz LAYER 8 Global Moderator
                        last edited by johnpoz

                        6 years is a lot of data for what type of data it is.. Does that really make sense to keep the data for that long?

                        I just looked and mine goes back to dec 2017.. I would assume when I fired up this 4860.. But when moved to new hardware I wouldn't be bringing that data over..

                        While its great info for sure, but I doubt the bug report would get much looking into until someone is sitting around twiddling their thumbs - hmm, hmm what to work on ;)

                        A quick fix I would think would just be to truncate rrd data at X.. So it only ever goes back so far, or so many specific data points..

                        But yeah @scurrier great work on tracking it down..

                        An intelligent man is sometimes forced to be drunk to spend time with his fools
                        If you get confused: Listen to the Music Play
                        Please don't Chat/PM me for help, unless mod related
                        SG-4860 24.11 | Lab VMs 2.7.2, 24.11

                        1 Reply Last reply Reply Quote 0
                        • S
                          scurrier
                          last edited by

                          There's a quick and easy version of a fix so maybe I will try to figure out which git branch to work from and submit a pull request on it that mimics the open heart surgery I did in the browser.

                          1 Reply Last reply Reply Quote 0
                          • S
                            scurrier
                            last edited by

                            Looks like I've been running this firewall since the end of 2014.
                            d7c8fc7a-9538-4da1-b1c0-adf595072249-image.png

                            Interestingly, the data for some metrics doesn't go back as far as others. Compare these two. Note that the first one shows nulls for part of the data even though I know the firewall was running at that time. I wonder why.
                            eda75b94-cea8-470d-be37-c99a340eb28a-image.png

                            e1089d6d-49b7-4489-8a50-7ff255c3e557-image.png

                            GertjanG 1 Reply Last reply Reply Quote 0
                            • GertjanG
                              Gertjan @scurrier
                              last edited by Gertjan

                              @scurrier said in Status: Monitoring is completely broken, pfSense 2.4.5:

                              I wonder why.

                              Have a look at the the folder where rrd files are stored. Maybe you used another type of WAN interface before, like a pppoe access - or static setup. The rrd file will have another name for that period.

                              Btw : didn't know that that much data is stored in a rrd file :

                              477942f9-6aef-4de8-9b3a-21c17ef91aee-image.png

                              mine goes back to 2014 also.
                              The file that was just before that time was based on a pppoe access, I wiped that stale rrd file long time ago.

                              edit :
                              IPv6 since 2014 also :

                              7b6a473d-0d2c-4b1a-be01-f638e85145fd-image.png

                              The big chunck was the period I tried to sync my Syno NAS with some Microsoft Office cloud over night. I'll try that again when fiber gets invented.

                              No "help me" PM's please. Use the forum, the community will thank you.
                              Edit : and where are the logs ??

                              1 Reply Last reply Reply Quote 0
                              • kiokomanK
                                kiokoman LAYER 8
                                last edited by

                                strange, but i don't have that lines on my php,
                                the one you mentioned here
                                https://forum.netgate.com/post/925667

                                Immagine1.jpg

                                maybe it's not up to date?

                                ̿' ̿'\̵͇̿̿\з=(◕_◕)=ε/̵͇̿̿/'̿'̿ ̿
                                Please do not use chat/PM to ask for help
                                we must focus on silencing this @guest character. we must make up lies and alter the copyrights !
                                Don't forget to Upvote with the 👍 button for any post you find to be helpful.

                                1 Reply Last reply Reply Quote 0
                                • S
                                  serbus
                                  last edited by serbus

                                  Hello!

                                  Just winging it here...I dont have a rrd dataset to reproduce the problem or test any solutions...

                                  It looks like most (all?) rrd's are setup to hold 2284 days (6.25yr) of data. Once hit, the consolidation strategy must push the resolution out to 7200sec (?).

                                  You could always increase the max. Is 10 years enough? Maybe 12years? There would need to be special update scripts to "resize GROW" all the current data. Yuk!

                                  Maybe just add any possible resolutions to timelookup? What if rrd starts using one that wasnt added?

                                  The simplest solution might be to just convert the timeformat assignment to a ternary operator, like :

                                  var timeFormat = (timeLookup.hasOwnProperty([data[0].step]) ? timeLookup[data[0].step] : "%Y-%m-%d");
                                  

                                  Maybe you could backup/zip/post your rrd data set so others could load/test possible solutions?

                                  It is pretty cool that pfsense has the durability/longevity to run into this sort of problem :).

                                  John

                                  Lex parsimoniae

                                  1 Reply Last reply Reply Quote 0
                                  • First post
                                    Last post
                                  Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.