I've mentioned some rc.d problems in other thread but decided to create new one as I've debugged problem a little more. Feel free to merge threads if you think they should be merged :-)
The bugs are…
1. Function restart_packages() from rc.newwanip doesn't do restart properly
2. Packages aren't stopped on system reboot/halt
3. Backup router forks lots of php causing OOM error and system failure
Now... rc.newwanip has function restart_packages which calls rc.start_packages (in background). Restarting packages should first stop them waiting for processes to finish and then start them again. Unfortunately calling "stop" was removed from rc.start_packages long long time ago and placed in rc.stop_packages. Its fine, but rc.stop_packages is never called or grep is cheating me :-) Because of that few thing doesn't work as they should.
1. User scripts may missbehave, ie. I've placed in /usr/localt/etc/rc.d simple script that launches tcpdump into background to log some traffic. Script has proper handling of start, stop and restart commands. After reboot I have lots and lots instances of tcpdump running because stop wasn't called before start when exectued from restart_packages().
2. System services aren't restarted. Calling just "start" to restart service won't work if it checks if its already running. Thus no restart ever occurs for such services.
3. Services aren't properly stopped on system reboot/halt. That means processes are simply killed at some point. This may lead to data corruption depending on services used and how they handle signals.
4. Any service related operation (start/stop/restart) shoudn't be lanuched into background. If there are many interfaces used in system rc.start_packages is called in background many times. Even with proper implementation of start/stop it won't work because script processes will interfere with each other (one stopping, second starting, third stopping and so on).
Here is little patch that fixed bugs #1 and #2: http://pastebin.com/ARVfDvLs
Finally bug #3 about which I wrote in other thread, but back then I didn't knew reason of such behavior. Now after little research I know exactly what happens. My setup has total of 51 network interfaces reported by ifconfig. After each click on "Apply changes" in any part of pfSense web configurator on my master router following things happen on backup router:
1. /usr/local/sbin/check_reload_status is run
2. for each network interface /etc/rc.newwanip is called, by now we have 51 php processes spawned
3. each copy of /etc/rc.newwanip calls /etc/rc.start_packages shell script, we have 51 of those too but they quickly disappear because…
4. each copy of /etc/rc.start_packages executes each *.sh file from /usr/local/etc/rc.d in background so each *.sh script is spawned 51 times almost simultaneously
Lets say I'm working on adding aliases, rules, configuring VPNs etc. and lets say I'll click "Apply changes" 5 times in 5 minutes. I'll have 255 php processes which will launch each script from /usr/local/etc/rc.d 255 times. Usually thats enough to hog C2D CPU, eat 4 gigs of ram, 8 gigs of swap effectively killing machine.
I'm wondering if sources of check_reload_status are available somewhere? In worst case (no fix in mainstream pfSense) I'll be able to fix this problem on my own.
The source for the check_reload_status binary is in the pfsense-tools repo.
Usually rc.newwanip only gets run on interfaces that have a gateway assigned, though I can see how that would have scaling issues in the case you are seeing. Ermal is the one who wrote the check_reload_status program, he'll probably have some ideas on a fix for this. He's traveling right now (returning from BSDCan), I'll open a ticket so this doesn't get lost.
Thanks for digging into the code and such a detailed report. It helps quite a bit.