Bootup WAN RRD errors reported

phil.davis

APU with SD card and nanoBSD
2.2-BETA (amd64)
built on Wed Nov 26 15:27:11 CST 2014
FreeBSD 10.1-RELEASE

During bootup some errors are reported in the system log about:

ERROR: opening '/var/db/rrd/WAN_DHCP-quality.rrd': No such file or directory

When the system has booted, the file is there in /var/db/rrd and I can see WAN traffic, packets and quality data from previous times as well as current since upgrade/reboot. So it is probably a timing thing on nanoBSD - I suspect something is trying to use the file before it has been reloaded from the SD card into memory filesystem /var.

No idea if this message has been coming out for a long time - now that I looked in the system log for other reasons, I am noticing a few other little things that could do with investigating and cleaning up. It is nice if we can have no "normal" error messages.

Sample of system log around the time of the RRD error messages:

Nov 27 09:35:26 	kernel: done.
Nov 27 09:35:26 	php: rc.bootup: ROUTING: setting default route to 10.49.80.250
Nov 27 09:35:23 	check_reload_status: rc.newwanip starting ovpnc1
Nov 27 09:35:23 	kernel: ovpnc1: link state changed to UP
Nov 27 09:35:21 	check_reload_status: Linkup starting re2
Nov 27 09:35:21 	devd: Executing '/usr/local/sbin/pfSctl -c 'interface linkup start re2''
Nov 27 09:35:21 	kernel: re2: link state changed to UP
Nov 27 09:35:21 	kernel:
Nov 27 09:35:20 	kernel: done.
Nov 27 09:35:18 	php: rc.bootup: The command '/usr/bin/nice -n20 /usr/local/bin/rrdtool update /var/db/rrd/WAN_DHCP-quality.rrd N:U:U' returned exit code '1', the output was 'ERROR: opening '/var/db/rrd/WAN_DHCP-quality.rrd': No such file or directory'
Nov 27 09:35:18 	kernel: .done.
Nov 27 09:35:18 	php: rc.bootup: RRD create failed exited with 1, the error is: ERROR: creating '/var/db/rrd/WAN_DHCP-quality.rrd': No such file or directory
Nov 27 09:35:18 	kernel: ....
Nov 27 09:35:17 	kernel: pflog0: promiscuous mode enabled
Nov 27 09:35:17 	kernel: tun1: changing name to 'ovpnc1'
Nov 27 09:35:17 	kernel:
Nov 27 09:35:17 	php: rc.bootup: Resyncing OpenVPN instances.
Nov 27 09:35:17 	check_reload_status: Linkup starting re0
Nov 27 09:35:17 	devd: Executing '/usr/local/sbin/pfSctl -c 'interface linkup stop re0''
Nov 27 09:35:17 	kernel: re0: link state changed to DOWN
Nov 27 09:35:17 	check_reload_status: Linkup starting re2
Nov 27 09:35:17 	devd: Executing '/usr/local/sbin/pfSctl -c 'interface linkup stop re2''
Nov 27 09:35:17 	kernel: re2: link state changed to DOWN
Nov 27 09:35:17 	kernel:
Nov 27 09:35:17 	check_reload_status: rc.newwanip starting re1
Nov 27 09:35:15 	check_reload_status: Linkup starting re1
Nov 27 09:35:15 	devd: Executing '/usr/local/sbin/pfSctl -c 'interface linkup start re1''
Nov 27 09:35:15 	kernel: re1: link state changed to UP
Nov 27 09:35:14 	sshlockout[8123]: sshlockout/webConfigurator v3.0 starting up
Nov 27 09:35:14 	sshd[7866]: Server listening on 0.0.0.0 port 22.
Nov 27 09:35:14 	sshd[7866]: Server listening on :: port 22.
Nov 27 09:35:13 	check_reload_status: Linkup starting re1
Nov 27 09:35:13 	devd: Executing '/usr/local/sbin/pfSctl -c 'interface linkup stop re1''
Nov 27 09:35:13 	kernel: re1: link state changed to DOWN
Nov 27 09:35:13 	kernel:
Nov 27 09:35:13 	kernel: Trying to mount root from ufs:/dev/ufs/pfsense0 [ro,sync,noatime]...

cmb

I've been seeing that for quite some time on nano. I agree it sounds like the cause is along the lines of what you described. I opened a ticket.
https://redmine.pfsense.org/issues/4048

If you notice any other "normal error" logs, please go ahead and open tickets in redmine. If they're things we can easily and quickly fix, or you can submit a pull request, great. Otherwise those will probably get pushed off to post-2.2 release. We're on track to hit RC in a matter of days, and those are "perfect is the enemy of good" type things that can delay a release forever.

phil.davis

those are "perfect is the enemy of good" type things that can delay a release forever

I agree - but also it is ideal if there are no error or warning messages emitted anywhere under normal boot and operation.

I would really like to see an RC, because actually everything I use seems to be working nicely now. Bit selfish eh - I don't really care about bridges, MTUs, IPsec, gif, gre… - but I suppose others do ;) Once an RC comes out I would guess that we would be much more careful about just fixing things in a less-risky way (i.e. not making wholesale changes to underlying base code just to fix a little edge case...). An thus I get inclined to upgrade a production system/s where I actually work and close by.

I will have a look myself at this crud in the next day or so. If I can see an easy fix (might be just putting the new platform_booting() check in more/better places) I will make pull request/s.

phil.davis

It is a timing issue with trying to create the quality data file before setting up the /var/db/rrd folder.
This fixes it for me:
https://github.com/pfsense/pfsense/pull/1349

eri--

This is wrong since the /var/db is mounted during bootup or created.
Normally it should not be possible to have that run before bootup.

Its to understand what happens that makes that trigger before.

I pushed some fixes to make sure this was not triggered from an upgrade of config today.

Actually i just pushed a commit that should help with that.

phil.davis

Yes, what you did in https://github.com/pfsense/pfsense/commit/f72ea82a9bcdbcac6041e6df460f82517e762968 moves the creation of /var/db/rrd into both enable_rrd_graphing() and create_gateway_quality_rrd() so that whatever calls those things at whatever time, the dir can be created if it does not yet exist. So that should be even safer.

I had just moved the code up in gwlb.inc so that /var/db/rrd was created before entering the "foreach($gateways_arr…" loop, keeping the actual code chunk the same.

That chmod on tmp_path was already there - I just moved it along with all the dir creation code. I didn't like to remove it or mess with it, for fear of breaking something! (It did seem an odd bit of code - but it is hard to know when coming across these things if there is some real reason, or if it is actually just an error from the past.)

Maintenance of old non-obvious code can be a bit of a headache ::)

eri--

I pushed the fixes.
They will be on next snapshots please test with that.