Alix power resets report for 2.2.4-DEVELOPMENT
built on Sat Jul 11 02:49:04 CDT 2015
Alix 2D13 with VLANs, OpenVPN, DNS Resolver, Dynamic DNS, DHCP…
a) 5 power plug pulled while system running normally with partitions mounted RO - no problem
b) 5 power plug pulled at various points in the boot sequence while it is configuring stuff and partitions are mounted RW - on the next boot it says "WARNING: / was not properly dismounted" (true), did some file system checks which were fine, and booted successfully - good.
c) Make some config change on the webGUI and press save, pull the power 1 to 5 seconds after pressing save, and before the webGUI comes back to say the change is saved. After the next boot sometimes the change was there, sometimes not - which is fine, depending on how quickly the power was pulled. The boot with fsck was always fine and I never got a corrupt or empty config.
d) From the command line:
time /etc/rc.conf_mount_rw echo "Stuff" > /cf/conf/z.txt time /etc/rc.conf_mount_ro
Mount RW takes about 0.5 seconds.
Mount RO takes from 2.5 to 3.9 seconds.
Without the "echo" command in the middle the mount RO still takes similar 2.5 to 3.9 seconds. Writing 1 little file made no statistically noticeable difference at this level of timing.
e) From the command line:
time /etc/rc.conf_mount_rw echo "Stuff" > /cf/conf/z.txt
Pull the plug in the next second without re-mounting RO. z.txt was there with the expected content after the next boot - good.
These sort of tests would need to be run a few hundred times to really "know" that things are OK. But at least I am feeling better. From Redmine I see that @cmb is doing some controlled power cycling tests over hundreds of cycles, which should give real confidence.
When making changes from the webGUI I am experiencing a wait of 12 seconds before the page comes back with the config change saved. There is 2 to 4 seconds in the mount RO from the test above. I guess the rest is somewhere in Alix processing and the number of writes involved in writing the backup config, deleting the oldest backup config and writing the new config…
If the system turns out to be safe from corruption due to power interruptions then 12 seconds on an old Alix/CF card is bearable for me. If there are lots of changes to be made in a short period we can always mount RW first, make all the changes, then mount RO at the end.
Thanks for the tests & report.
So I felt to upgrade from 2.2.2 to 2.2.4-dev, after I downgraded from 2.2.3 to 2.2.2 because 2.2.3 was indeed slow as discussed elsewhere. Now this 2.2.4-dev on RO looks good & snappy again.
built on Sat Jul 11 02:49:04 CDT 2015
& native IPv6 over PPPoE, DNS Resolver, DHCP(v4), DHCPv6, RA, 2x LAN.
Thanks for the feedback.
After the last of our changes on Friday, we had an APU and ALIX running power cycles over and over in a loop while permanent rw mounted on the slowest SD/CF I could find. Each went through 1000 power cycles with no issue. IP PDUs are handy. :) Only certain cards seem to corrupt themselves easily in that circumstance. A SanDisk CF even with SU+J and permanent rw didn't break. The Kingston CF I used for the 1000 cycles never made it more than 2-3 power cycles left rw SU+J.
The last part on config writing is being tested right now. The basic case, making a config change and immediately pulling the power, was fine last week. The enclosing directory wasn't being fsynced though, which meant if you did something insane like write_config() in an endless loop called at startup and drop its power right in the middle of that loop, you could be left with a wiped out /cf/conf/backup/ because it overwrote the entire history in 1-2 seconds of the loop and that wasn't fsynced. If you hit the exact right (or wrong) moment, you could end up with a missing config, and no backups to restore. That's about a 1 in 200-300 tries occurrence where you're writing the config in a non-stop loop when power is lost, and probably impossible to encounter in any reasonable real world circumstance. But even that scenario should be fine now. Two systems running in a power cycle loop right now to confirm that. Leaving that running all night.